Skip to content

Add: Worker-level chip bootstrap orchestration for distributed L3#613

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/worker-chip-bootstrap-L6
Apr 21, 2026
Merged

Add: Worker-level chip bootstrap orchestration for distributed L3#613
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/worker-chip-bootstrap-L6

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 21, 2026

Summary

Wires ChipWorker.bootstrap_context into the Worker factory so an L3 Worker(level>=3, chip_bootstrap_configs=[...]) brings up every chip child's communicator during init() and surfaces a ChipContext list to orch code before the first run().

Note on terminology: the L3 in the title refers to runtime hierarchy Level 3 (chip-level Worker) — see docs/hierarchical_level_runtime.md. Earlier split commits (#608, #610) used (L2)/(L5) tags as split-step labels, which collides with the Level-0..Level-6 hierarchy; this PR drops those labels from new and touched code.

  • ChipContext dataclass in task_interfacedevice_id / rank / nranks / device_ctx / local_window_base / actual_window_size / buffer_ptrs: dict[str, int]. The per-buffer dict is built by zipping cfg.buffers with the result's buffer_ptrs, so orch code addresses a named window slice without tracking list indices. A length check before the zip raises RuntimeError on a parent/child buffer-count mismatch instead of silently truncating.
  • Parent: per-chip ChipBootstrapChannel mailbox (4096 B shared-memory, zero-filled so state starts IDLE) allocated pre-fork. Parent polls each channel with time.sleep(0.001) + 120 s soft timeout; on the first ERROR raises RuntimeError(f"chip {idx} bootstrap failed: {channel.error_message}") and best-effort SIGKILLs every forked child + unlinks every shm so init() raises cleanly without leaking state. chip_contexts is a property that raises before init().
  • Child: new _chip_process_loop_with_bootstrap runs bootstrap_context first (channel publishes SUCCESS/ERROR), then enters the same task/control poll loop as _chip_process_loop. try/finally runs shutdown_bootstrap then finalize on SHUTDOWN. Bootstrap failure returns via os._exit(0) so the parent's waitpid isn't confused by a non-zero exit code layered on top of the channel's error.
  • Teardown ordering: _worker.close()SHUTDOWNwaitpid → unlink sub/chip/next-level mailboxes → bootstrap mailboxes unlinked last, because chip children touch their ChipBootstrapChannel inside shutdown_bootstrap() + finalize().
  • The original _chip_process_loop and _Worker scheduler wiring are untouched; the bootstrap path is gated on a non-None chip_bootstrap_configs argument and runs eagerly at init() time instead of the usual lazy _start_hierarchical() on first run().

Does not extend to level-4+ recursive Worker children — the _next_level_workers fork path is unchanged; adding distributed bring-up for nested Workers is a follow-up.

Testing

  • tests/ut/py/test_worker/test_worker_distributed_sim.py — happy path + error path (bogus placement triggers RuntimeError) + chip_contexts-before-init guard + __init__ validation (level<3 reject, length-mismatch reject).
  • tests/ut/py/test_worker/test_worker_distributed_hw.py — 2-card hardware smoke, drives Worker(level=3, chip_bootstrap_configs=[...]) end-to-end, asserts each rank's device_ctx != 0, local_window_base != 0, actual_window_size >= requested, and buffer_ptrs == {"x": local_window_base}. No comm_barrier — HCCL 507018 stays off the critical path. Lives under tests/ut so the ut-a2a3 job picks it up without xdist's per-worker device-slicing (which would break a 2-device request under tests/st).
  • pytest tests/ut/py/test_worker with chip_bootstrap_configs=None paths — 59 green, no regression.

Ref: #571 (split), builds on #608, #610.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces worker-level chip bootstrap orchestration (L6). It adds the ChipContext dataclass and updates the Worker class to support asynchronous bootstrap of chip children via shared-memory mailboxes. Key changes include a new child process loop that executes bootstrap_context, a timeout-based polling mechanism in the parent to collect results, and enhanced cleanup logic to prevent shared-memory leaks on failure. New hardware and simulation tests are also provided. Feedback is provided regarding a potential silent truncation issue when zipping buffer pointers, suggesting an explicit length check to ensure data integrity.

Comment thread python/simpler/worker.py
@ChaoWao ChaoWao force-pushed the feat/worker-chip-bootstrap-L6 branch 2 times, most recently from 43e439a to e70e464 Compare April 21, 2026 03:35
@ChaoWao ChaoWao changed the title Add: Worker-level chip bootstrap orchestration for distributed L3 (L6) Add: Worker-level chip bootstrap orchestration for distributed L3 Apr 21, 2026
- Add ChipContext dataclass in task_interface (device_id/rank/nranks +
  device_ctx, local_window_base, actual_window_size, buffer_ptrs: dict
  by name) — exposed to L3+ orch code after a successful bring-up
- Wire Worker(level>=3, chip_bootstrap_configs=[...]) so each chip child
  runs ChipWorker.bootstrap_context before entering the main task /
  control loop; parent blocks on a per-chip ChipBootstrapChannel until
  every chip reports SUCCESS, assembles ChipContexts, and fails fast on
  the first ERROR (best-effort SIGKILL + waitpid for the rest, shms
  unlinked so init() raises cleanly without leaking state)
- Explicit length check before zipping cfg.buffers with the channel's
  buffer_ptrs, so a parent/child buffer-count disagreement raises a
  descriptive RuntimeError instead of silently producing a truncated
  buffer_ptrs dict in the ChipContext
- Bootstrap mailboxes are allocated pre-fork (SharedMemory zero-fills
  -> IDLE) and unlinked *after* chip pids are reaped, since chip
  children touch the channel inside finalize()
- Drop stale split-step labels (L2/L5/L6) from new code and from prior
  chip_bootstrap docstrings since they collide with the runtime Level
  0-6 hierarchy documented in docs/hierarchical_level_runtime.md
- Add sim UT (happy path + error path + validation + chip_contexts-
  before-init guard) and hardware UT (2-card, no comm_barrier so the
  HCCL 507018 known-issue stays off the critical path)
@ChaoWao ChaoWao force-pushed the feat/worker-chip-bootstrap-L6 branch from e70e464 to 72a0f2a Compare April 21, 2026 07:58
@ChaoWao ChaoWao merged commit fa33039 into hw-native-sys:main Apr 21, 2026
14 checks passed
ChaoWao added a commit to PKUZHOU/simpler that referenced this pull request Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant